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Spectral clustering uses the global information embedded in eigenvectors of an inter-item simi- 
larity matrix to correctly identify clusters of irregular shape, an ability lacking in commonly used 
approaches such as fc-means and agglomerative clustering. However, traditional spectral clustering 
partitions items into hard clusters, and the ability to instead generate fuzzy item assignments would 
be advantageous for the growing class of domains in which cluster overlap and uncertainty are im- 
portant. Korenblum and Shalloway [Phys. Rev. E 67, 056704 (2003)] extended spectral clustering 
to fuzzy clustering by introducing the principle of uncertainty minimization. However, this posed 
a challenging non-convex global optimization problem that they solved by a brute-force technique 
unlikely to scale to data sets having more than O(10 2 ) items. Here we develop a new method 
for solving the minimization problem, which can handle data sets at least two orders of magnitude 
larger. In doing so, we elucidate the underlying structure of uncertainty minimization using multiple 
geometric representations. This enables us to show how fuzzy spectral clustering using uncertainty 
minimization is related to and generalizes clustering motivated by perturbative analysis of almost- 
block-diagonal matrices. Uncertainty minimization can be applied to a wide variety of existing hard 
spectral clustering approaches, thus transforming them to fuzzy methods. 



I. INTRODUCTION 

Coarse-graining data items i (1 < i < N) into clusters 
ct (1 < a < m) is important for large-scale data analy- 
sis [H-IH . For example, clustering genes according to their 
microarray expression profiles allows biologists to subse- 
quently infer potential cis-regulatory elements from se- 
quence commonalities within the clusters Q. Clustering 
typically proceeds from a symmetric N x N similarity 
matrix S, where the non-negative off-diagonal element 
Sij provides an inverse indicator of the "distance" dij 
between items i and j. The primary input (e.g., the align- 
ment scores from sequence comparisons or edge weights 
of a graph) may directly define the <Sy . Alternatively, the 
data may consist of Np properties for each item that can 
be embedded in a dataspace. For example, in microarray 
analysis each gene is an item, and its properties are its 
Njj expression levels under Np different conditions. In 
that case, the dij are derived from the (not-necessarily 
Euclidean) distances between the items in the dataspace. 

Spectral clustering methods ([H, [|| for history and re- 
view) analyze the eigensystem of a transition (or Lapla- 
cian) matrix T, which is derived from S. Since the eigen- 
system depends globally on the entire data set, spec- 
tral methods have a perspective lacking in commonly 
used methods such as fc-means and agglomerative cluster- 
ing Q , which directly analyze the Sij . Their dependence 
on pairwise similarities leads them to impose character- 
istic cluster shapes; e.g., fe-means and complete-linkage 
clustering generate convex clusters while single-linkage 
clustering generates unbalanced and straggly clusters [2| . 
These shapes may not reflect the true geometries of the 
problem, such as the irregular boundaries of a subject 



within an image Q ■ The ability of spectral methods to 
generate arbitrary cluster shapes lets them outperform 
fc-means across several benchmarks And as we 

will see, they can also determine the optimal number of 
clusters automatically. 

T typically satisfies [H 
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where D-^ is a diagonal normalizing matrix with non- 
negative elements satisfying Tr(Z?7r) = 1, 1 is the item- 
space vector having all components equal to one, and ■ 
denotes the normalized item-space inner product: 



N 

xy = N^^Xiyi 



These conditions emerge when spectral clustering meth- 
ods are used t o app roximate "min-cut" graph partition- 
ing solutions |13L Il4| or when the y a re motivated by 
discrete- |15l - ll9f or continuous-time [lfj dynamical mod- 
els. [The first two motivations lead to analysis of the 
Markov matrix T = I — T (where / is the identity ma- 
trix), which satisfies 1 • T = 1 rather than Eq. (|ld|). But 
since the eigenvectors of T and T are identical and the 
eigenvalues are simply related, the same analysis applies 
with inconsequential changes.] 
Eqs. |T]) imply 



'Electronic address: bsw27@cornell.edu 
T Electronic address: dis2@cornell.edu 



70 





Nn 
1 

D- 1 ■ xb R 



(2a) 
(2b) 
(2c) 
(2d) 



2 



where and i/^ are the bi-orthogonal left and right 
eigenvectors of T, which we normalize such that xjj^ ■ 
ipn — &mn and tp^-xjj^ = 1, and 7r is the right equilibrium 
probability vector satisfying 7r^ = 1 . It follows that 
(D~ 1 )u = tt^~ . Eqs. (p} also imply that TijiVj = Tjtni 
(i.e., that detailed balance holds), which ensures the re- 
ality and non- negativity of the eigenvalues (2pj . 

Spectral methods begin by embedding each item i into 
the low-frequency (or clustering) subspace R m using as 
coordinates the m low-frequency vector components of 

~^ L {i) = ft/>o 0), i>i(i) ■ ■ ■ EH- These arc then 

used to identify m clusters [221] . Clustering (i.e., spatial 
coarse-graining) is possible only if there is a gap in the 
distribution of the similarities Sy [23| . 

The dynamical interpretation of spectral clustering 
provides a way to find a gap if it exists: Each cluster 
is viewed as a metastable state of a diffusive relaxation 
process governed by T [24| 
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where p(t) is a time-dependent probability vector over 
the discrete space of items [i.e., Pi(t) is the probability 
of occupation of item i at time t], —Tij is the stochastic 
transition rate from item j to i, and Eqs. (| lc[) and (|ld|) 
ensure that probability is conserved. Because of the in- 
verse relationship between eigenvector "wavelength" and 
eigenvalue, a spatial-scale gap in the distribution of the 
Sij will appear as a time-scale spectral gap: 



= 7o < 7i < ■■• < 7m-i < In 



(4) 



The gap between 7 TO _i and 7 TO indicates the existence 
of m clusters. When a spectral gap exists, the long- 
wavelength, clustering eigenvectors ipjizm wn l contain 
the information needed for clustering |25j. 

For example, Fig. [1] illustrates the m = 3 "spiral" clus- 
tering problem posed by 77 items embedded in a two- 
dimensional dataspace and the corresponding eigensys- 
tem of the T matrix of Rcf. [HI [see Eqs. (JMH below]. 
Panel (a) shows the spatial locations of the items, and it 
is subjectively evident that there are three interlocking 
clusters. Correspondingly, as predicted by Eq. ((4]), there 
is a gap between 72 and 73 [panel (c)]. The clustering 
eigenvectors, [panel (d)] and i/>f [panel (e)], vary sig- 
nificantly only at the cluster boundaries and follow their 
distorted shapes. Thus, the shapes of the clusters defined 
using these eigenvectors will not be artificially restricted. 
In contrast, the non-clustering eigenvectors such as ip^ 
[panel (f)] have large variations within clusters and thus 
are not used in the clustering analysis. 

It remains to define the clustering from the cluster- 
ing eigenvectors. Hard spectral clustering approaches do 
so simply by applying non-spectral methods such as k- 
means within the clustering subspace @. However, there 
are problems where hard partitioning is neither necessary 



nor ideal, for example, the separation of cell subpopula- 
tions by fluorescence activated cell sorting (FACS) [2r| . 
automated biolo gica l database curation [27|, complex 
network analysis 28 1, and gene expression analysis 29} . 
Such problems require fuzzy clustering that can represent 
uncertainty and overlapping clusters. 

Non-spectral fuzzy clustering methods have already 
been applied to such problems [281 . 29], but spectral fuzzy 
methods could be advantageous because of their added 
ability to cope with irregular cluster boundaries (such as 
those within FACS dataspaces (26|). Moreover, fuzziness 
could provide further benefit even in areas where hard 
spectral clustering has already been applied. For exam- 
ple, Paccanaro et al. have used hard spectral cluster- 
ing to faithfully reproduce many of the superfamily classi- 
fications from a subset of the SCOP protein database [30| : 
a fuzzy spectral approach would add the ability to assess 
the certainty of such classifications. 

Formally, fuzzy clusterings are described by assign- 
ment vectors w a = [w a (l), w a (2), . . . , w a (N)], where 
w a (i) is the probability that item i is a member of cluster 
a, and therefore must satisfy the probabilistic constraints 



w a {i) > (V a, i) 

5> a (i) = 1 (Vi)- 
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To define these in a spectral context, following Ref. [10| 
we use the low-frequency clustering eigenvectors as a lin- 
ear basis for the w a [311 ] : 
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where the M a = [M a o, M a \, . . . M Q ( m _i)] are m- 

vectors, t/v^ = [i/>q , , . . . , t^m-ili and o denotes the 
inner product over the low-frequency subspace: 
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Eq. ((6]) transforms the clustering problem to that of 
finding the "best" M a subject to Eqs. ([5]). Korcnblum 
and Shalloway [l(| proposed that this was the one that 
minimized overlap between assignment vectors: Since the 
w a are non-negative and composed of only the long- 
wavelength ipn< m > they will inevitably overlap each other 
and thus will give uncertain (i.e., fuzzy) item-to-clustcr 
assignments. This uncertainty is minimized when the 
clusters' self-overlap is maximized. The self-overlap (of 
cluster a) can be quantified by the fractional cluster cer- 
tainty T Q (M) (1 < a < m) 0, 
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(N- 1 < T a (M) < 1) , (7) 
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FIG. 1: The "spiral" clustering problem and its eigensystem. (a) The two-dimensional embedding of the spiral data set in 
dataspace. (b), (d), (e), (f) The (positive or negative) heights of the cones indicate the values of the clustering eigenvectors 
•0Oi 0i i alm 02 i an d °f the first non-clustering eigenvector, 03 of the data set's T. (c) The corresponding eigenvalues. Unless 
otherwise noted, figures are based on the T matrix defined by Korenblum and Shalloway [l(J [see Eqs. ((24)) ]. 



where M represents the components of all the M a and 
bra-ket notation denotes the equilibrium-weighted inner 
product [HI 

{x\y) = x-D n -y. (8) 

T a (M) = 1 when the cluster a is completely certain, 
i.e., w a (i) = or 1; the total certainty is the product 
of the T Q (M) for all the clusters. Thus, the optimal M 
is determined by uncertainty minimization of the overall 
uncertainty objective function, 

$(M) = -J>gY a (M), ( 9 ) 

a 

subject to the constraints of Eqs. Korenblum and 
Shalloway showed that this procedure provided good 
fuzzy clusterings of a number of difficult problems. How- 
ever, they solved the resulting challenging constrained, 
non-convex uncertainty minimization problem using a 
"brute-force" solver whose 0(m 2 N m+1 ) computational 
complexity limited its application to modest-sized prob- 
lems (N — 200) and precluded application to the larger 
problems [e.g., N ~ Q(1Q 4 )1 that emerge in areas such as 
gene microarray analysis |33| . 



A closely related approach was independently devel- 
oped by Weber et al. [HI]. They also used Eq. J6]), but, in- 
stead of using uncertainty minimization, determined the 
M through an efficient, but approximate, method moti- 
vated by perturbative analysis of almost-block-diagonal 
matrices [34|. Their Perron Cluster Cluster Analysis 
(PCCA) defined the w a as "membership functions" that 
only approximate the probabilistic constraints of Eqs. 
([5]). In PCCA the M are determined algorithmically 
rather than by objective function optimization, and clus- 
terings for different values of m are accepted if the resul- 
tant approximation is regarded (by subjective criteria) 
to be adequate. While approximate, this method had 
the advantage of being computationally simpler than the 
initial uncertainty minimization algorithm of Korenblum 
and Shalloway |10j . 

Thus until now, practical, exact fuzzy spectral data 
clustering has remained elusive. To resolve this prob- 
lem, here we develop an efficient method for uncertainty 
minimization and show that it is generally applicable to 
any spectral clustering method satisfying Eqs. (P), in- 
cluding popular asymmetric approaches based on ran- 
dom walks over graphs fl5l4l9l ]. Thus, we imbue a wide 
range of hard spectral clustering methods with the abil- 
ity to represent fuzzy cluster assignments and, thereby, 
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uncertainty and cluster overlap. In the process, we show 
that there arc multiple geometric interpretations of the 
uncertainty minimization problem that can be used to 
illuminate its structure. Through these we relate uncer- 
tainty minimization to PCCA and extend the previously 
reported conditions under which the PCCA approxima- 
tion is applicable. 



II. COMPUTATIONAL THEORY 

Minimization of $(M) subject to the constraints of 
Eqs. (0 poses a global, non-linear optimization problem 
in the to 2 degrees of freedom of A/. To solve this it is 
convenient to recxpress Eq. ^ explicitly in terms of the 

M a as 

S(M) ^ - ^ log T a (M) = - J2 l"g ^ ° ^" . (10) 

a a M a O Eq 

where £o is the m-vector (1,0,..., 0), and we have used 
(w a \w a ) = M a oM a and (l|u; Q ) = Ai Q oe , which follow 
from Eqs. (|2c|) . (|2d[) , and ([6]) and the bi-orthogonality of 
the eigenvectors. Similarly, we reexpress Eqs. ([5|) in terms 
of the M a : 



w a (i) = M a o > 

^ M Q = £0 ■ 



(Va,i) 
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Because <&(M) is invariant under permutations of the in- 
dices associated with the clusters, its global minimum 
will have an ml-fold permutation degeneracy. 

We now describe two geometric representations that 
illuminate the problem (Sec. Ill A[) and then show how to 
solve it in three steps: (1) precondition T to avoid numer- 
ical noise that can obfuscate spectral gaps when low-lying 
eigenvalues are nearly degenerate, to improve numerical 
efficiency, and to remove outliers ( Appendix [XJ, (2) find a 
zeroth-order solution (Sec. Ill B"|) . and (3) iteratively refine 
using linear programming with a subset of the inequal- 
ity constraints to determine the solution to the desired 
accuracy (Sec. Ill C|) . Since the procedure explicitly uses 
only the ip^ > f°r notational convenience we subsequently 
denote them simply as the ip n . 



Geometric representations of uncertainty 
minimization 



Symmetric M -representation 



-ri 



Each M a may be regarded as the coordinates of a 
particle a in M. m with axes labeled X , X\, . . . ,Xt m _\y 
Eq. (|lla[) implies that the same N inequality constraints 
act on each particle; thus they restrict each one to the 



same half-space in R m bounded by a hypersurface pass- 
ing through the origin and normal to ip(i). The inter- 
section of these half-spaces determines the feasible re- 
gion as a convex polyhedral cone in the upper half of 
W n . Only a subset of the inequality constraints will ac- 
tually bound the feasible region, since their satisfaction 
will automatically guarantee satisfaction of the other con- 
straints. And, as proved in Appendix IB II each particle 
lies on an edge of the polyhedral cone (i.e., is constrained 
by to — 1 active inequality constraints) at every local min- 
imizcr of $(M). 

An example of this symmetric M -representation for 
an to = 2 problem (based on the "crescentric" bivariatc 
data set of Ref. Q) is shown in Fig. 2(a) (It is only in 
the m = 2 case that a simple graphical representation is 
possible; nonetheless it is useful for illustrating structural 
properties that also hold when to > 2.) In this case, the 
feasible region is bounded by only two lines corresponding 
to x"o ip (i < ) = and A^o ip (i>) = 0, where z< and z> are 
the minimizer and maximizer of respectively. The 

global minimum of $ corresponds to the unique (up to 
the permutation degeneracy) situation where each parti- 
cle lies on the feasible region boundary while the equality 
constraints of Eq. (|llb[) are simultaneously satisfied. In 
Fig. 2(a) this is when the points are located at the two 
squares on the boundary. The two ways of associating 
the particles with the squares corresponds to the 2-fold 
permutation degeneracy of the solution. 



2. Asymmetric M -representation 

The to particles in the symmetric M-represcntation are 
not independent because of the equality constraints [Eq. 
(| 1 lb[) ] - We use these in the asymmetric M -representation 
to explicitly eliminate the degrees of freedom of one slave 
particle that, without loss of generality, we take to be 



M r 



£o 



(12) 



The homogeneous inequality constraints on the slave, 

Mm ° ip (i) > (Vi), transform into inhomogeneous in- 
equality constraints that couple the remaining to — 1 free 
particles: 



(13) 



We consolidate the to(to — 1) degrees of freedom of the 
free particles into the supervector A2 froc having compo- 
nents (Mi, M 2 ,...,M m _i) in R m ( m_1 ). Optimization 

then proceeds in R" 1 '" 1 " 1 ) with the At free restricted by 
(to — 1)N homogeneous inequality constraints from Eq. 
(|llap with a < to and N inhomogeneous inequality con- 
straints from Eq. (|13p . The combination of homogeneous 
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FIG. 2: Symmetric and asymmetric M-representations of 
the m — 2 "crescentric" problem 0, E3l- ( a ) Symmetric M- 
representation: The diagonal lines indicate the boundaries 
formed by the inequality constraints. The two bold lines 
forming the narrowest cone (shaded) define the feasible re- 
gion in R m = R 2 . Mi and M2 are represented by dots. They 
are not independent since they are further constrained by the 
equality constraints of Eq. ()llb|l . The global minimum of 
the uncertainty objective function "l>(Af) corresponds to the 
dots being located at the positions indicated by small squares, 
and the invariance under particle exchange corresponds to the 
permutation degeneracy discussed in the text, (b) Asymmet- 
ric M -representation: The solid lines indicate the boundaries 
of the homogeneous inequality constraints acting on the free 
particle A/ frcc = M%. The dashed lines indicate the bound- 
aries of the inhomogeneous inequality constraints that de- 
rive from the slave particle A22. Two of these (bold-dashed) 
lines cap the cone formed by the relevant homogeneous con- 
straint boundaries (bold) to define a closed feasible polytope 
in R m < m ~ 1 ) = R 2 . I n this representation the single dot rep- 
resents all m(m — 1) = 2 components of M iree . $(M) is min- 
imized at either of the two permutation-degenerate solutions 
(small squares). 



and inhomogeneous inequality constraints forms a closed 
convex polytope that bounds the feasible region. Each 
local minimum of &(AI) (and thus, the global minimum) 
lies at a vertex of this polytope ■ 

An example of the asymmetric M-representation for 
to = 2 is shown in Fig. |2(b)| In this case there are 
four bounding constraints: two homogeneous inequality 
constraints having boundaries passing through the origin 
and two inhomogeneous inequality constraints (from the 
slave cluster) with boundaries intersecting at io (HE]- $ 
is infinite at the polytope vertices at the origin and £0. 
The two other vertices correspond to index-permutation- 
cquivalcnt global minima. 

The minimization problem can be visualized and eas- 
ily solved in this manner only for to — 2: As m increases 
the number of polytope vertices, and hence the number 
of local minima, grows rapidly, and the global minimiza- 
tion problem becomes difficult. Korenblum and Shal- 
loway jToj solved this by an expensive, random explo- 
ration of the vertices. 



B. Cluster representatives and the approximate 
global solution 

1. Representatives 

We take a different approach: Rather than trying to 
identify the minimizing vertex directly, we exploit the 
fact that the to 2 components of M can be determined by 
the to 2 low-frequency components of an appropriately 
chosen subset 1Z = {n, r%, . . . , r m } of m items, which 
we call representatives. To make this explicit we write a 
matrix analog of Eq. © over 1Z as 



W n = M o m n 



(14) 



where 



K — 



n — 



W a {rj3) 

4>n(r a ) 



(1 < a,P < m) 

(1 < a < to) 
(0 < n < to) 



and M is the matrix having the M a as its rows. Accord- 
ing to Eq. (|llb[) . M must satisfy 



(16) 



As shown in Appendix IB 21 there always exists at least 
one subset 1Z such that ^> n is invertible. With such a 
subset we can solve Eq. f| 14[) for M: 



M 



W n • (* 



K\-l 



(17) 



where • denotes the inner product over the cluster index 
a. 

The usefulness of Eq. (jTTJ) may be questioned since a 
priori we do not know any exactly. However, any 
data set amenable to clustering will have at least one item 
per cluster that will be strongly assigned in the clustering 
solution; we call such items candidate representatives. If 
we could select a set of representatives 1Z C containing 
one candidate representative from each cluster, we could 
use our approximate foreknowledge of their assignment 
values at the solution, W n, = , to approximate M at the 
solution, M*, via Eq. (fT7]) . 

For example, if item i a were a candidate representative 
for cluster a, its assignment in the clustering solution 
would be [3a| 



W*p(ia) ~ 0~af: 



(18) 



By choosing r a = i a and making similar choices for the 
other clusters, we would get 



W 

This zeroth-order estimate could be used to approxi- 
mately solve Eq. (fl7|) for M*: 



M* 



w 



Tic 



n a \-i 

1 = ( $ Kc)-l 



(19a) 
(19b) 
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In agreement with Eq. (fT6|) . M° would satisfy [37} 

E M °« = *«°- ( 2 °) 

a 

Knowing M° would allow us to define zcroth-ordcr cs- 

— i — y 

timatcs for all the items via Eq. ([6]) with M a = M° a , 
where the M° are the rows of M°: 

w° a = A/0 o ^ . (21) 

However, the 10° would not necessarily satisfy the in- 
equality constraints of Eq. (|5a|) . If they did, they would 
solve the optimization problem (see Sec. Ill C ip . If they 
didn't, they would provide a starting point for refining 
the solution as discussed in Sec. Ill CI 



2. Finding 1Z C 

Eq. (|19b|) implies that we only need to find the repre- 
sentatives to determine M°. This is trivial when m = 2: 
The two active inequality constraints [identified by ei- 
ther pair of intersecting bold and bold-dashed lines in 
Fig. |2(bj] come from the extremal items r\ and T2 of xpi, 
i.e., the minimizer and maximizer of if>i(i). Thus, at the 
solution w\(r2) = and u^^i) = 0, and the equality 
constraints imply that w*(r{) = 1 and u^ 7 ^) = 1: ?~i 
and r-i not only generate the active constraints, but are 
also the representatives, which in this case are perfectly 
assigned in the solution. 

The situation is more complicated when m > 2. The 
representatives: (1) may not be maxima and minima of 
the eigenvectors, (2) may not be the items associated 
with the active constraints, and (3) may not be perfectly 
assigned at the solution. Nonetheless, as discussed above, 
they will satisfy w a (rp) ~ 8 a p and we will use this prop- 
erty to identify them. 

We show how this is done using the m = 3 spiral prob- 
lem as an example (Fig. [3]). Its three low-frequency clus- 
tering eigenvectors are shown in panel (b) , and the repre- 
sentatives that we would like to find arc identified by cir- 
cles, triangles, and squares. To find 1Z C we imagine that 
we know M* and the corresponding assignment vectors 
•tu* so that we can map the items into M. m at the points 
specified by the 3- vectors w*(i) = [wl(i),W2(i),w^(i)] in 
panel (c) [38[. Because the w*{i) satisfy the probabilistic 
equality constraints, these points lie in the 2-dimcnsional 
plane that is normal to the vector (1, 1, 1) and at distance 
1/vo from the origin. Moreover, they satisfy the proba- 
bilistic inequality constraints and thus lie within an equi- 
lateral triangle in this plane. (We use "within" to include 
points that lie on the boundary.) This provides barycen- 
tric coordinates [3!| in which the three vertices of the 
triangle correspond to the cluster assignments (1,0,0), 
(0, 1, 0), and (0, 0, 1); we will call these the a = 1, 2, and 
3 vertices, respectively. The three components of w*{i) 



are given by the three distances of point i from the three 
sides of the triangle. Thus, if point i lies on the side of 
the triangle opposing vertex a, the inequality constraint 
w a (i) > is active. We call this the w) A -representation 
[panel (c)]. Although it may not be evident in the figure, 
consistent with the even distribution of active inequality 
constraints between the clusters (Appendix IB 1[) . each 
side of the triangle intersects exactly two items. 

The candidate representatives are the items that are 
close to the three vertices, and we want to choose one 
from the vicinity of each vertex to compose 7Z C . We can 
do this by choosing the three items that (when taken as 
vertices) define the triangle of largest area. It is easy 
to show that the triangular area defined by any subset 
1Z of three items located at their solution positions is 
|W K |/(2 V3). Thus, we can find a good 7Z C by finding 
the subset TZ that maximizes \W n |. 

Since we don't actually know M* or the w*(i), it is not 
obvious how to proceed. However, Eq. (fH|) implies that 

\W n '\ = \M*\ \V n \ , (22) 

so, since M* is fixed (though unknown), selecting the TZ 
that maximizes {W 71 | is equivalent to selecting the TZ 
that maximizes |\I> 7?, |. This is straightforward because 
ty n does not depend on M. Formally, maximizing \^> n \ 
is a combinatoric problem that could be solved by com- 
paring the determinants for all subsets 1Z. However, this 
would be exponentially expensive in N. Instead we use 
an efficient greedy algorithm that selects the representa- 
tives solely from the subset of candidate representatives. 
This may not exactly maximize the determinant, but will 
be adequate to determine an 1Z C that gives, via Eq. (|19b[) . 
an M° that can be used as a starting point for refinement. 

We leave the details of the greedy algorithm to Ap- 
pendix[Cj but it is useful to establish its geometric frame- 
work here, continuing to use the spiral problem as an ex- 
ample: We first plot each item in the 2-dimensional ip 1 -- 

representation using the 2-vector ip {%) = [ipi(i),ip2(i)] 
[panels (d) and (f)]. [No information is lost in this pro- 
jection from the low- frequency subspace since ipo{i) = 
1 (V i).] These vectors are independent of M rather, 
in this representation M determines the position of the 
inequality constraint bounding triangle. As explained in 
Appendix IB 4| the ip coordinates of the three bounding 
triangle vertices are the columns of the bottom two rows 
of M~ x . When M = M* [panel (d)], the vertices may not 
coincide with any items, but all the items will lie within 
the bounding triangle. When M = M° [panel (f)], the 
vertices of the triangle coincide with the representatives, 
but some items may violate the inequality constraints 
and lie outside the triangle. (Four items in the upper 
left corner are outside the triangle in this example.) The 
greedy algorithm operates within the ip ^-representation 
to identify 7Z C . 

The approach generalizes easily to higher m: The 
W*(i) are now m- vectors. The u) A -representation is in 
an (to — l)-dimcnsional hypcrplane normal to the vec- 
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FIG. 3: w - and t/; ^representations of the spiral problem. The items are represented in the dataspace as peaks with 
magnitudes determined by their maximal assignment (a), in the clustering (low- frequency) eigenvector representation (b), in 
the barycentric coordinates of the w A -representation (c) and (e), or in the x -representation (d) and (f). Panels (c) and (d) 
correspond to the refined solution M* , while (e) and (f) correspond to the zeroth-order solution M . The solid and dotted 
triangles denote the M* and M feasible region boundaries. [The solid triangle is superimposed in panel (f) to show how the 
triangle expands slightly in the ip ^-representation during refinement. The arrow indicates the item that becomes an active 
constraint in M* .] The left and right arrows connecting the representations are reminders that M determines the positions of 
the items in the w) A -representation and of the triangle vertices in the t/> x -representation. [Although it may not be evident in 
the figure, the points in panels (c) and (e) and the top vertex in panels (d) and (f) are at slightly different positions.] The 
different shades of gray in panel (a) denote the hard clustering obtained by quantizing the fuzzy clustering, while the height 
of a cone shows the strength of the probabilistic assignment of the item to the cluster. The ordering of items in panel (b) was 
chosen post facto to separate the clusters. The dashed lines in this panel are at ip n (i) = 0. The representatives for clusters 1, 
2, or 3 are enclosed within triangles, circles, or squares, respectively. 
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tor (1,1,..., 1) in R m and provides barycentric coor- 
dinates for the w*(i). 7Z C is comprised of the subset 
of m items that, when located at their solution posi- 
tions in the w A -representation, are the vertices of the 
(m — l)-simplex of largest hypervolume. This hypervol- 
ume, for any subset 1Z, is proportional to \ W n | so, via 
Eq. (f2"2")) . we can transform the problem of selecting 7Z C 
to that of finding the the m items that maximize Ivf^l. 
This problem is equivalent to maximizing the hypervol- 
ume of the ip -'--representation simplex having vertices at 
{%> {€) : i G 7Z}. Once 1Z C has been identified, it is 
used to determine M° via Eq. (|19b|) . and M° is used to 
determine via Eq. ([21]) . 



C. Refinement 



1. Case when M is the exact solution 



is the gradient of $(Af) with respect to M a . Local min- 
imization using this linear approximation and the con- 
straints of Eqs. (fTT|) pose a linear programming (LP) 
problem, which can be solved by standard methods. 

A simple approach would be to: (1) apply LP using 
Eq. (|23p and all the constraints to find an improved, 
constraint-satisfying solution M 1 , (2) set M° <— A/ 1 , 
and (3) repeat (1) and (2) until sufficient convergence is 
achieved. This amounts to constrained gradient-descent 
local minimization. However, we do not expect to en- 
counter the slow convergence problems that sometimes 
plague gradient descent because all the LP solutions, as 
well as the true solution, are at vertices of the feasible 
polytope 41] . Therefore, even the first iteration will 
drive the solution to a vertex, and the solution will not 
change at the next iteration unless the vertices are very 
dense on the scale set by the curvature of <I>(M). Thus, 
rapid convergence is expected. 



If the satisfy all the inequality constraints, they 
provide the unique solution to the uncertainty mini- 
mization problem. To prove this, consider the ip J_ - 
representation of an m = 3 problem where the inequality 
constraints are satisfied. As in the spiral problem, the 
representatives are at the vertices of the ip - 1 triangle de- 
termined by M , and as discussed above, transforming 
M° to M moves the sides of this triangle. Moving any 
side inwards would leave a representative outside the tri- 
angle, thus violating an inequality constraint. And, since 
all points are already within the triangle (i.e., all inequal- 
ity constraints are satisfied), moving any side outwards 
would result in that side contacting less than two points, 
i.e., one of the clusters would have less than the required 
(Appendix IB lj) m — 1 = 2 active inequality constraints. 
Therefore, in this case M* = M° must be the unique 
solution. As can be inferred from the analysis of Fig. [2l 
M° is always the unique solution for m — 2 problems. 



2. Linearizing <E>(A/) 

If the 10° violate any of the inequality constraints, M 
is not a solution but can be used as the starting point for 
further refinement. Since it is expected to be near M*, 
we can expand the objective function in its neighborhood 
to first-order as 



$(M) = log 



aI q o m q 



M a o e 

*(M°) + til) o ^(M) jf (23) 

z — ' V / M=M° 



where 



=s — — — 



5M r 



|M Q | 2 M a oi 



3. Reducing the number of constraints included in LP 

However, the cost of standard LP solvers (e.g., simplex 
and interior point methods) grows rapidly [0(N^ 5 )] with 
the number of constraints N c , which may be large (42j. 
While there are mN inequality constraints, only m(m — 
1) of these are active at M* . These alone need to be 
included in the LP problem to guarantee that all the 
inequality constraints will be satisfied. Since we will often 
be interested in problems where m ~ 0(10) and TV ~ 
O(10 4 ), it would accelerate the LP solver by multiple 
orders of magnitude if the number of constraints provided 
to it were reduced to 0(m 2 ). 

We do not know the active constraints a priori, but 
can find them rapidly by an iterative procedure that ex- 
ploits the fact that (as discussed above) at M* exactly 
m, — 1 points will lie on each of the m faces of the bound- 
ing simplex in the rp ^-representation. To motivate this 
procedure, consider the refinement of the spiral problem 
(Fig. [3]). The left side of the (dotted) M° triangle [panel 
(f )] must move outwards to include the four points in the 
upper left region that are excluded from its interior; this 
motion must leave the side intersecting two points. Be- 
cause the objective function $(Af ) constitutes an inward 
"pressure" on the triangle, M* will correspond to the sit- 
uation where the smallest expansion that can accomplish 
this is used. Consequently, the left side will pivot out- 
wards about the lower left corner until it intersects the 
item identified by the arrow. Each side of the resulting 
M* triangle [panel (d)] will intersect m — 1 = 2 points, 
and these points will be near (but not identical with) the 
m = 3 vertices of the M° triangle. These six intersections 
will identify the m(m — 1) = 6 active constraints. 

This suggests that, for m = 3 in general, the two points 
lying on a side of the M* triangle will be near different 
vertices and, subject to this restriction, will be the points 
that are farthest outside the M° triangle. This easily 
generalizes to m > 3: Each of the m faces of the M* 
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simplex will contain m — 1 item points, each near a dif- 
ferent vertex. These m(m — 1) points are the most likely 
to lie outside the M° simplex. Thus, it is sensible to 
initially attempt a LP solution using only the inequality 
constraints corresponding to these m(m — 1) face-item 
point pairs. [If point i lies on the face opposing vertex a, 
this face-item pair corresponds to the active inequality 
constraint w a (i) = 0.] However, this is only a heuristic 
argument, and inequality constraints may still be vio- 
lated in the partially constrained LP solution. If so, we 
iterate while adding to an included constraint list C (of 
face-item pairs) the violated constraints that are iden- 
tified by the above criteria as most likely to be active. 
The procedure terminates when all the inequality con- 
straints are satisfied. Termination is guaranteed because 
inequality constraints are only added to, and never re- 
moved from, the included constraint list. The procedure 
is formalized below. 



jf.. Refinement Algorithm 

1. Initialize C to the empty set. 

2. Perform hard clustering based on the M° assign- 
ments: Item i is assigned to the cluster (vertex) a 
that maximizes w^(i). We call this subset of items 
S a . 

3. Identify the item (designated ip) from Sp (f3 ^ a) 
that is farthest outside the face opposing vertex a. 
This identifies the m — 1 constraints corresponding 
to the face-item pairs (a,ip : j3 ^ a). As shown 
in Appendix IB 5[ the ordering of the item points 
relative to the simplex faces is the same in the w A - 
and i/j ^-representations. Therefore, wc determine 
the ordering in the w A -representation barycentric 
coordinates since this is simple: w a (i) is the dis- 
tance of an item point i from the a-opposing face 
(positive if inside, negative if outside the simplex). 
When executed for all m faces this procedure iden- 
tifies m(m — 1) inequality constraints C. 

4. C<-Cl)C. 

5. Apply the LP solver with the equality constraints, 
the inequality constraints in C, and the linear ob- 
jective function approximation of Eq. f)23|) . 



6. Check for satisfaction of all inequality constraints 
and for convergence according to max^ i — 
^aWI < /°LP: where plp is a small number, and 
w^(i) and w^(i) are the values determined by 
M 1 and M°, respectively. If both conditions are 
satisfied, terminate with M* = M ; if not, set 
M° <- M 1 and return to step 2. 

When the algorithm is applied to the spiral problem, 
C is set to the active constraints in a single step [43j . 



III. OVERALL COMPUTATIONAL 
ALGORITHM 

Combining the steps described in Sec. HH the overall 
algorithm is: 

1. Compute and precondition T as described in Ap- 
pendix 



2. Compute 20 [44[ low-frequency clustering eigenval- 
ues and eigenvectors using the Lanczos method [45[ . 
This is more efficient than computing the full eigen- 
system, but will converge slowly if the eigenvalues 
are densely-packed near zero (as they often are). 
To exclude this possibility we employ a shift-and- 
invert spectral transformation [46|, which spreads 
out the small eigenvalues by transforming them into 
the large magnitude eigenvalues of a related spec- 
tral decomposition having the same eigenvectors. 

3. Following Ref. [13 , determine m according to the 
lowest spectral gap satisfying ^ m Hm-\ > P~t> 
where p 7 is the minimum gap parameter. If there 
is no gap, the algorithm has determined that there 
are no clusters and terminates. 

4. Identify the representatives and compute the 
zcroth-order solution M° and using the pro- 
cedure of Sec. IIIBI 

5. Determine if violates any inequality constraints. 
If so, iteratively refine M° to M* using the pro- 
cedure of Sec. Ill Cl and, via Eq. ((6|), compute the 
refined solution id*. Otherwise, iu* = u>°. 

6. Following Ref. test the solution against the 
minimum certainty conditions T a (M) > pr (Vq), 
where py is the minimum certainty parameter. If 
it satisfies them, the solution is accepted. If not, 
the eigenspectrum can be tested for higher spec- 
tral gaps, and the algorithm proceeds with step 4. 
If desired, the fuzzy solution can be quantized to 
a hard clustering by assigning item each i to the 
cluster having the largest assignment value; these 
hard clusters may be recursively analyzed. 



IV. RESULTS 

We tested the efficiency of our method for uncertainty 
minimization by using it for fuzzy spectral clustering of 
a family of synthetic data sets containing up to N = 
20, 000 items. Further, we showed that it can be applied 
to both symmetric and asymmetric T matrices popular 
in the literature. 
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A. Implementation 

The CH — h implementation was compiled using gcc 
version 4.1.2 and g77 version 3.3.5 under -03 opti- 



47 routines 



mization. It accesses low-level LAPACK 
through LAPACK++ [48[ version 2.5.2, interfaces to the 
ARPACK [H Lanczos solver through the ARPACK++ 
CH — h wrappers [50l [. and solves constrained linear pro- 
grams using the GLPK simplex method [5l[ version 4.9. 
The scaling benchmarks of Sec. IIVBI were executed on 
a dedicated quad CPU 3.46 GHz Pentium 4, configured 
with 4 GB of RAM and 4 GB of swap space, and running 
a 64-bit version of SuSE Linux. The numerical precision 
parameter was e = 2.22045 x 10~ 16 . The minimum gap 
and minimum certainty parameters were set to p~ ( = 3 
and py = 0.68 [loj . The LP convergence parameter was 
p LP = 0.001. 



B. Computational efficiency and scaling 
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FIG. 4: Log-log plot of elapsed computational time versus N 
for synthetic benchmarks. N was varied from 5, 000 to 20, 000 
in steps of 1, 500. The results shown for m = 2 (O) and 10 (□) 
are averages over five runs and are representative of those for 
2 < m < 10. (The standard errors of the mean are too small 
to be discernible.) The dotted line is the least-squares linear 
fit and has slope 1.8. 



To evaluate the efficiency and cost scaling of uncer- 
tainty minimization, we applied it to synthetic data 
sets containing from 2 to 10 clusters and from 5, 000 
to 20,000 items arranged in a pyramid of blocks in a 
two-dimensional dataspace. For these tests we used the 
Laplacian T defined by Eqs. (p} and the definitions of S 
and D v arisin g fr om the continuous dynamical interpre- 
tation of Ref. 

e -d%/2(d 2 ) 

S v = 32 (i*3) (24a) 

a ij 

(Ar)« = /V" 1 , (24b) 

where dij is the Euclidean distance between items i and 
j in the dataspace and dp i s a characteristic distance of 
the problem: 

N 

(4) = tf-'E&c (25a) 

i=l 

di < = min dij . (25b) 

These problems required up to four invocations of the 
LP solver, with the number increasing with m, but not 
evidently with N. The log-log plot in Fig. [4] shows 
that execution time was proportional to N 1 - 8 with lit- 
tle dependence on m. Execution time was dominated by 
the calculation of S and by the eigensolver (each having 
roughly equal cost), with uncertainty minimization con- 
tributing < 10% of the total in all problems tested. The 
largest problem (m = 10, TV = 20,000), which is of the 
scale of biological microarray gene expression data sets, 
only required about 30 seconds on a commodity proces- 
sor. 



C. General applicability 

Uncertainty minimization is applicable to spectral 
clustering using any T defined by Eqs. ([1]), including un- 
normalizcd and normalized forms that arc popular in the 
literature. Of course, the success of any method will 
depend on the choices of S and D^, which are highly 
problem-specific, and we do not address this issue here. 
Our goal was to demonstrate the applicability of uncer- 
tainty minimization to this wide range of formulations. 
Thus, in addition to the tests described above using the 
r of Eqs. (j24p . we applied uncertainty minimization to 
the spiral problem using two other forms of T. The first 
one, an asymmetrically normalized Laplacian T ( [52| for 
review) with Sy a Gaussian function of dij, commonly 
arises when a Markovian [l7l - [l9l . [53[ rather than a contin- 
uous flfj| dynamical interpretation is used. It is specified 
by Eqs. ^} with 

S tJ = e- d ^ /2a2 (26a) 

(Ar)i« = , (26b) 

where a is chosen by empirical tuning [l7l - [l9l . [53^ or 
heuristics HQ. We chose cr 2 = {d\). (This type of 
r, but with a non-Gaussian ^-also frequently arises in 
image segmentation 0, UBI . [l6l |55| where it is motivated 
by the "normalized cut" variant of the min-cut graph 
partitioning method 0.) We also tested the symmet- 
ric, unnormalized Laplacian form ( [56| for review) of T 
specified by 

Sij = e"4/ 2ff2 (27a) 
(D n ) u = N- 1 . (27b) 
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This form is popular in graph partitioning problems (e.g., 
VLSI circuit partitioning [52H59J, parallel matrix factor- 
ization [6(3], and computational load balancing (6lL Iffill). 
where it is used to approximate the solution to the 
"ratio cut" variant of the min-cut graph partitioning 
method When applied to graph partitioning the 

Sij are simply edge weights, but to apply it to the spiral 
data clustering problem the £y must be computed from 
the dij ; for this wc again used the Gaussian form of Eq. 
(|27a[) because it is popular in dataspace clustering Q. 

Fig. [5] shows the results obtained by using uncertainty 
minimization for fuzzy spectral clustering of the spi- 
ral problem with the T matrices defined by Eqs. 
(f26|) . and (|27|) . In each case the algorithm selected the 
same three representatives and the LP solver was invoked 
twice. While there were minor differences in the w a along 
the cluster boundaries, the use of all three F gave es- 
sentially the same results. In contrast [panel (d)], the 
spiral problem confounded fc-means with "extragrades" 
(which is an outlier- robust variant of fc-means) [€331 ] . As 
discussed in the Introduction, this failure of fc-means is 
not surprising, given the irregular, interlocking nature of 
the clusters. 



V. DISCUSSION 

To date, spectral clustering has been used primarily for 
hard partitioning. Prior studies [l(| [H| have suggested 
that fuzzy spectral clustering could be accomplished by 
using the m low-frequency eigenvectors of T as a linear 
basis for expanding, via a transformation matrix M, the 
fuzzy cluster assignment vectors w a , where w a (i) is the 
probability that item i is assigned to cluster a. Koren- 
blum and Shalloway [Io[ suggested that M* , the opti- 
mal M, is best identified by uncertainty minimization, 
which minimizes the probabilistic overlap between clus- 
ters. Uncertainty minimization has the additional ad- 
vantage of providing measures (the final values of the 
objective function and fractional cluster certainties) that 
quantify the quality of a clustering, which can be as im- 
portant as the clusterings themselves. However, Koren- 
blum and Shalloway did not provide an efficient means of 
solving this challenging non-convex global minimization 
problem, which limited their approach to small data sets 
with N ~ O(10 2 ) items. Alternatively, Weber et al. [3 
suggested that M could be determined by perturbativc 
approximation from almost-block-diagonal matrices, but 
this approach gives w a that only approximately satisfy 
the probabilistic constraints of Eqs. ([5]). Thus, until now 
there has been no computationally practical, exact fuzzy 
spectral data clustering method. 

To address this need we developed an efficient method 
for uncertainty minimization, which extends the num- 
ber of items that can be clustered by at least two orders 
of magnitude: data sets with ./V ~ O(10 4 ) can now be 
analyzed within ~ 30 seconds on a commodity proces- 



sor. Using tests with synthetic data sets having up to 
20, 000 items and ten clusters we showed that compu- 
tational cost scaled ~ 0(N 1 - 8 ) and was insensitive to 
the number of clusters. This implies that as many as 
N ~ O(10 6 ) items can be clustered in modest time on a 
serial machine. The additional cost of uncertainty min- 
imization was small compared to costs common to all 
spectral clustering methods (e.g., computing T from the 
dij and computing its low- frequency eigensystem). 

In developing this approach we elucidated the under- 
lying structure of the uncertainty minimization prob- 
lem. This revealed fundamental relationships be- 
tween four different geometric representations: the Tri- 
dimensional symmetric ill-representation, the m(m — 
l)-dimcnsional asymmetric M-rcprcscntation, and the 

(m — l)-dimensional u> A - and ijj ^-representations. All 
are formally equivalent, but each has advantages: The 
symmetric A/-rcprcsentation has the most direct con- 
nection to the minimization problem. The asymmet- 
ric M-representation provides a closed feasible region; 
it is used to prove that all local minima are at polytopc 
vertices and that the inequality constraints are evenly 
distributed between the clusters at these points. The 
w A -representation provides barycentric coordinates and 
makes it evident that the m cluster representatives in 1Z C 
are those items that determine the (m — l)-simplex of 

largest hypervolume. The ip ^-representation motivates 
the greedy algorithm used to approximate 1Z Cl which in 
turn yields M°, the starting point for refinement to M*. 

The greedy algorithm wc used is almost identical to 
the "inner simplex method" used in the Perron Cluster 
Cluster Analysis method (l8l . HH for approximate fuzzy 
data clustering [sBI ] . However, our motivation for the 
algorithm, and consequently our understanding of its do- 
main of validity, are different. The inner simplex method 
was motivated by earlier studies [661 l67l| on perturbation 
theory of block-diagonal matrices [34j . These studies ex- 
ploited two observations: (1) that the T of well-separated 
clusters can be brought into almost-block-diagonal form, 
and (2) that the low-frequency eigenvectors of such a 
r are perturbed only in second-order in the non-block- 
diagonal terms, and therefore, to this order, possess a 
"level structure" in which their components are concen- 
trated near m different values. The inner simplex method 
aims at finding one item from each level set and thus, 
in principle, depends on their existence. In contrast, the 
analysis presented here makes no assumptions about level 
structure and only presumes that at least one item (i.e., 
the representative) can be well-assigned to each cluster. 
An example where representatives exist, even though ma- 
trix perturbation theory is no longer applicable and the 
eigenvectors do not have a level structure, is illustrated 
in Fig. [SJ Even in this case it is evident that there are 
three fuzzy clusters, although many of the items will have 
weak assignments. Thus, the greedy algorithm is more 
generally applicable than previously stated. 

In the two-cluster case the M° solution is always ex- 
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(d) 



FIG. 5: Fuzzy spectral clustering by uncertainty minimization of the spiral problem using three different F matrices. The fuzzy 
clusterings and clustering eigenvectors, eigenvalues, and assignment vectors computed using the F matrices specified in (a) Eqs. 
(|24[) . (b) Eqs. (|26[1 . and (c) Eqs. (J27J) arc shown. Representatives are indicated by triangles, circles, and squares in the two left 
columns, (d) Application of fuzzy fc-means with extragrades to this problem; arrows identify misclassified items. 



13 




(a) (b) 

FIG. 6: Assignment vectors for a three-cluster problem for 
eigenvectors with or without a "level structure." Eigenvectors 
arising from almost-block-diagonal V have a level structure 
leading to "almost-hard" assignment vectors such as those 
shown in panel (a). (Different symbols are used for the three 
assignment vectors.) When there is no level structure, the 
assignment vectors are much softer, as in panel (b). However, 
even in this case representatives (identified by arrows) exist. 

act, but in the tested m > 2 problems, it always violated 
some of the inequality constraints required for a proba- 
bilistic interpretation of the w a . These violations were 
removed by refinement. The corrections changed w a (i) 
by < 0.05; so, except when high accuracy is needed, the 
most important role of the refinement may be to provide 
a rational method for ensuring that the satisfy the 
probabilistic constraints. 

Deuflhard and Weber [68| used a metastability ob- 
jective function for clustering protein conformations 
collected from molecular dynamics simulations that is 
closely related to the sum of the fractional cluster certain- 
ties T a (M) defined in Eq. ([7]). Their objective function 
is the sum of terms 



where t denotes a time period which, in practice, is set 
to a multiple of the molecular dynamics integration time 
step [H [70|. v a (M;t) measures the fractional persis- 
tence of probability within subregion a of conformation 
space after stochastic evolution for time t, and is identi- 
cal to the T Q (M) except for the presence of the Markov 
matrix e _r *, generated by a T derived from the molec- 
ular dynamics data. Thus, T a (M) is the t — > limit 
of v a (M;t). It is not clear if a i-dependent objective 
function is appropriate for clustering data that does not 
arise in a dynamic manner, though this may be worth 
considering. 

Another potentially interesting objective function is 
the determinant of M. It is intriguing because of its 
simple geometric interpretation: We can show that max- 
imizing \M\ is equivalent to maximizing the hypervolumc 
of the (to — l)-simplex formed in the ^^-representation 
by any subset of to items [7i| . This property is attrac- 
tive since we expect a good clustering to spread the items 
out in this barycentric representation as much as possi- 
ble. However, \M\ does not have a simple information- 
theoretic interpretation as does $(M), defined in Eq. 



(|T0|) : cxp[— $(M)] is the product of the fractional clus- 
ter certainties, T Q , which are normalized to unity when 
the corresponding cluster is completely hard, but \M\ 
does not provide a measure of cluster hardness. More- 
over, while optimization using either \M\ or Q(M) tends 
to minimize overlap, optimization of \M\ also tends to 
equalize the size of the clusters [7l|. Although this is 
not necessarily desirable for data clustering, it may be 
of value in graph partitioning applications that seek to 
balance partition sizes 0, [13] ■ 

Uncertainty minimization and the method for effi- 
ciently solving it presented here are applicable to the 
wide range of popular V matrices that satisfy Eqs. JT]). To 
demonstrate this, we applied uncertainty minimization to 
the spiral data set, a convenient two-dimensional exam- 
ple with visually-discernible irregularly-shaped clusters, 
using one asymmetric and two symmetric forms of T. 
The resulting fuzzy spectral clustering gave similar re- 
sults with all three T matrices, while fc-means did not 
provide a valid clustering. Of course, these particular 
forms may not be suitable for all data sets — as in hard 
spectral clustering, T must often be tailored to the prob- 
lem. Our goal here was to demonstrate the ability of un- 
certainty minimization to efficiently fuzzify spectral clus- 
tering methods. It can now be applied to a wide variety 
of problem-specific domains, such as those noted in the 
Introduction. 
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Appendix A: T Preconditioning 

Numerical errors in computing the eigensystem in- 
crease with 7at_i/7i and, if this ratio is too large, can 
obscure differences between very small eigenvalues and 
obfuscate the spectral gap. This can occur if two items 
within a cluster are exceedingly close (and hence commu- 
nicate very rapidly) or if clusters are nearly isolated (and 
hence communicate very slowly). The latter situation 
can also occur if the data contain outliers — items that 
are distant from the bulk of the items. We avoid these 
problems by preconditioning T and, at the same time, 
improve computational efficiency by sparsifying it (i.e., 
by setting very small transition rates exactly to zero). 
[This reduces memory requirements and improves cache 
performance and eigensolver efficiency so that MDC may 
be practically applied to large problems. For example, for 
the largest of the scaling benchmarks considered in Sec- 
tion V.B. (i.e., to = 10, N — 20,000), the sparsified V 
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matrix held less than 650,000 independent elements, rep- 
resenting a storage reduction of a factor of ~ 300.] This 
involves three steps: (1) determine appropriate upper 
(<ihi) and lower (d\ ) bounds on the dij, (2) sparsify T us- 
ing G?hi and check for any resultant graph disconnections, 
and (3) evaluate the remaining matrix elements and trun- 
cate the large-magnitude elements using d\ 0l compute 71, 
bound 7jv-i 5 and confirm that 7^-1/71 is properly con- 
strained. If it is not, di is increased so that it will be. 
(Increasing d\ was not required for the examples in this 
paper, but this step is included precaution.) 

To avoid excessive numerical error, we want to adjust 
r so that 



< a 



7i 



(Al) 



where A 7 is the expected computational error in the 
eigenvalues and a is the desired fractional precision, e.g., 
- 0(1O- 2 ). Typically H^HS 

A 7 < 6 7AT-1 , 

where e is machine precision. So Eq. ([AT]) will be satisfied 
if 



7iv-i 
7i 



< a/e 



(A2) 



We expect that 7iv-i/7i will depend on |Thi | / 1 Ti 1 , the 
ratio of the largest to the smallest non-zero |r^j|, and 
one way to satisfy Eq. (|A2j) would be to limit this ratio. 
However, when clustering data, e.g., as in the examples of 
this paper, computing the from the d^ constitutes a 
significant fraction of total cost because exponentiation 
is required [at least for forms of S in Eqs. ([24]) . (|26| . 
and ([27]) ] and this is wasted for the large fraction of the 
Ti^j that are zeroed during preconditioning. Therefore, 
instead of directly limiting |rhi|/|ri |, we gain the same 
result by limiting dhi/^io, the ratio of the largest to the 
smallest dij . This allows us to sparsify before evaluating 
all but a few matrix elements. This indirect approach 
is not needed when applying uncertainty minimization 
to spectral clustering of graphs where S is specified a 
priori and, hence, all elements of T can be inexpensively 
computed. 



However, when T is asymmetric [i.e., (D v )u 7^ 1/N as 
in Eq. (|26[)]. then even this requirement can not be im- 
posed until is evaluated, and this would require costly 
evaluation of all the prior to sparsification. Thus, 
instead we apply Eq. (|A3|) to V s : 



m 



= a/e 



(A4) 



We expect this to be adequate because in most cases 
multiplying by will result in |r hi |/|r lo | < |r£|/|r£|. 
(This indeed is the case for the examples we have con- 
sidered.) However, exceptional sets of dij can be con- 
structed where it will not, so this is not guaranteed. 
Nonetheless, we use Eq. (|A4|) because of its reduced cost 
and the guarantee that Eq. (| A2|) will ultimately be sat- 
isfied by the confirmation and possible iteration steps 
described in Sec. IA 31 

To minimize the effect of preconditioning on the rest 
of the eigensystem, we multiplicatively center |r^| and 
|r^| around |r^ id |, a typical midrange rate. That is, we 
require 



l^midl 



(A5) 



mid I 



We determine |r^ id | by noting that ir^J, the magnitude 
of the largest rjLy in row i, is the largest transition rate 
connecting i to other items. Thus, the median of the 
[r^l is a reasonable choice for |r^ id |. Because |r£y| 
depends monotonically on dij [e.g., see Eqs. (|2"4"|). (|2l)j) . 
and (|27])]. this is equivalent to \T^ id \ = |r 5 (med{d,<})|, 
where med{<ii<} is the median of the {<ii<}, the smallest 
off-diagonal elements in each row of the d^ matrix. Thus, 
determining |r^ id | requires computing only one element 
of V s . Once this has been done, Eqs. 
be combined to give 



rf = |r 
if. = |r 



mid 



mid 




We then numerically invert Tfj(dij) 
Eqs. (|23J, (lia, or ([27j 



U) and (|A5|) can 

(A6a) 
(A6b) 

[e.g., using one of 



with rf, 



and 



r?j| to determine c?hi and d\ 0: respectively. 



1. Determining dhi and d\ 

Although a rigorous a priori bound on 7^-1/71 de- 
pends on N as well as on |rhi|/|ri |, we expect that in 
most cases the two ratios will be roughly of the same 
order-of-magnitudc since |r m | and | Tio | set the scales of 
the fastest and slowest dynamical processes in the sys- 
tem [72| . Thus, we can hope to satisfy Eq. (|A2j) by re- 
quiring that 

|54 = a/e (not used) . (A3) 

I 1 lo| 



2. Sparsification and connected component analysis 

V s is sparsificd by setting all off-diagonal elements hav- 
ing magnitudes less than T^ to zero. That is, 

r£y -> (if dij > d U ) ■ 

To test if this disconnects the graph, we perform a 
standard connected component analysis (73l |. This ini- 
tially assigns items to individual sets and then iteratively 
merges sets whenever any of their respective members are 
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connected. If distinct subsets (i.e., disconnected com- 
ponents) remain at the end, the algorithm creates hard 
assignment vectors identifying them. (This process may 
remove outliers.) Larger subsets may be analyzed as new 
clustering problems of their own. 



3. Truncation and checking the eigenvalue range 



those vertices where at least one of the M* = (since 
— > 

multiplying such an M * by £ Q has no effect). This proof 
does not preclude the possibility that a single item may 
be associated with multiple active constraints; i.e., it is 
possible that w a (i) = and Wp(i) = are both active 
constraints. {This is the case for the solution to the spiral 
problem [Figs. [3jc) and (d)] where w^ri.) = = u^ri) 
and also wifa) = = wafo)-} 



Having sparsihed the (typically large) fraction of in- 
significantly small off-diagonal elements, we now evalu- 
ate the remaining Tf^j while truncating their maximum 
magnitudes using 

rf^-iiti (if<%<o, 

and compute f T and T. We can then compute 71 using 
a Lanczos solver (see Sec. lIIII) an d bound jn-i using the 
Gershgorin Circle Theorem [74| and Eqs. (fTc|) and ()ld[) 
to 

7;v-i < 2max|r ii | , (A7) 

If 7! and the Gershgorin bound on 7at_i satisfy Eq. (|A2|) . 
then preconditioning is complete. If not, d\ , and hence 
|Fy|, is adjusted so that it will be satisfied when the 
|r£^j I are truncated to the new bound and T is recom- 
puted. Preconditioning is now complete. 



Appendix B: Various proofs 

1. Even distribution of active inequality constraints 

We prove here that each cluster must be constrained 
by exactly m — 1 inequality constraints at each local min- 
imum of $ in the feasible region. Consider a local min- 
imum M xfrcc in the asymmetric M-representation dis- 
cussed in Sec. Ill Al Korenblum and Shalloway [l(| have 
already proved that this must be at a vertex of the fea- 
sible polytope. The coordinates at the local minimum of 
the individual free particles, M* (1 < a < m), satisfy 
the inequality constraints of Eq. (|llap . but their homo- 
geneity means that they will also be satisfied for any 
(,aM a with £ Q > 0. Thus, the free particle inequality 
constraints acting alone leave the m — 1 degrees of free- 
dom £ Q unspecified and are inadequate to force Al xfree 
to be at a vertex of the feasible polytope. Therefore, at 
least m — 1 additional active constraints must come from 
the inhomogeneous inequality constraints associated with 
the slave particle [Eq. ([13])]. However, the choice of the 
slave particle in Eq. ([12]) is arbitrary. Therefore, every 
particle must have at least to — 1 active inequality con- 
straints. But since only m(m — 1) inequality constraints 
are active at a vertex, each of the to particles must have 
exactly to — 1 inequality constraints active. This proof 
extends to every vertex of the feasible polytope except for 



2. Invertibility of *^ 

We prove here that there is at least one subset of m 
items 1Z such that ^ n is invertible. We define the m x N 
matrix ^ by = ifj n {i) (0 < n < m; 1 < i < N). Since 
its to rows (i.e., the low- frequency eigenvectors) are lin- 
early independent, has rank m. Therefore, '5 also has 
at least to linearly-independent columns. If the items cor- 
responding to these columns are selected to comprise 1Z, 
then the to x to matrix has full rank and is therefore 
invertible. 



3. Invertibility of M 

We prove here that each M corresponding to a local 
minimum of $ within the feasible region is invertible. As 
proved in Appendix IB II at any such minimum each clus- 
ter has to — 1 active inequality constraints: to — 1 items 
lie on each of the to faces of the bounding simplex in the 
^^-representation. Consider a subset 1Z that contains 
one item from each face. It defines an (to — l)-simplex 
(inscribed within or identical to the bounding simplex) 
with non-zero hypervolume. This hypervolume is pro- 
portional to implying that \W n \ 7^ and, with 
Eq. ([17]). implying that \M\ ^ 0. Thus, M is invertible. 

Actually, the proof holds for every M having all M a 7^ 
that lies at a vertex of the feasible polytope in the 
asymmetric M-representation since Appcndix lB ll applics 
to all such M, not only those at local minima. 



4. The bounding simplex in the ip -representation 

Analogously to Eq. ([T4|) . we may write 

W vort = M o * vert , (Bl) 

where the columns of <I' vert are the coordinates of the 
bounding simplex vertices in the low-frequency eigenvec- 
tor representation and W vclt is the matrix whose rows are 
the coordinates of the vertices in the w A -representation; 
i.e., W VCIt = I. Inver ting this gives * vcrt = M _1 . When 
M = M°, Eqs. (fi"9bl) and ((BTJ imply that * vort = 
which is consistent with the zeroth-order placement of 
the representatives at the vertices. When M = M* , the 
vertices may not correspond to item locations, but, as 
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proved in Appendix IB 31 M* is invertible, so vl/ vcrt = 
(M*) _1 . In both cases, the simplex vertex coordinates 

in the ip ^-representation are given by the columns of 
^vcrt w it n th e gj. s ^ row om itted. (Just as for all el- 
ements in the first row of vI/ vort are one for any invertible 
M, in particular, for M° and M* [75[.) 



to the distance of point % from the side opposite to v 2 
in the ip ^-representation. Combining all these propor- 
tionalities proves that the distance of point i from the 
side opposing a vertex in the w) A -representation is pro- 
portional to its distance in the ip ^-representation. 
Appendix C: Greedy algorithm for selecting 1Z 



Same ordering of item points in the w 
ip ^representations 



and 



To simplify the proof of identical ordering, we use the 
spiral problem illustrated in Fig. [3] as a specific example; 
the proof is easily generalized. We index the vertices in 
the u) A -representation as described in Sec. Ill B 21 For 
example, the top vertex in panel (c) is vertex 3 and we 
denote it as v%. We carry the same indexing over to the 
ip ^-representation. 

Ordering the items according to their distances from 
the simplex faces is easy in the u) A -rcprcscntation: Be- 
cause it provides barycentric coordinates, the distance of 
a point i from the side opposite vertex a is just w a (i), 
with the sign negative if the point lies outside the sim- 
plex. The w A ordering can be related to the ip ± ordering 
in a few steps. First, note that the distance of point i 
from the side opposite i>2 is linearly related to the area of 
the triangle having vertices at points i, vx, and t>3, with 
sign depending on triangle orientation. This signed area 
is proportional to the ratio of determinants 



-4 



\w(i) <g) £i <g) £ 3 | 
\S2 ® S\ <g> £3 I 



where w (i ) <g> i\ <g> £3 is the 3x3 matrix formed by stacking 
the three row vectors and the denominator (which will 
always be ±1) ensures the correct sign. Second, note 
that since 

£2 <E> £1 <E> £3 = hi o [ip V2 (gi ip Vl <g) ip Vs ] , 



The goal of the algorithm is to choose the subset 
of items 1Z that approximately defines the (to — 1)- 
simplcx having maximum hypervolume V m -i in the ip - 
representation. If the hypervolume, V m -2, of one face of 
the simplex is already determined, V m -\ is proportional 
to the distance of the excluded vertex from that face. 
[For example, in the case of a 2-simplex (a triangle), this 
is the familiar area — 1/2 base x height rule, where 
"base" is the length of the determined simplex face and 
"height" is the distance of the other point from that face.] 
This suggests a natural greedy algorithm: (a) initialize 
by finding the (q — 1 = l)-simplcx of greatest length, (b) 
extend the (q — l)-simplcx to a q-simplex by finding the 
item that is furthest from the hypersurface that embeds 
the (q — l)-simplcx, (c) q q + 1 and return to step (b) 
until q = m. 

Specifically, 



i\ and ii that maximize 



1. Initialize: 

Select the two items 

||^(i 2 )-^(ii)||. 

K = {«1,«2}- 
q = 2. 



2. Repeat while q < m: 

(a) Select the item i q +\ that maximizes 

d ± (i g+ i) - - ~$Hii)]\\ , 

where 



v q =1 



q'=2 

{WjlZ^llU i q+1 
(c) q i- q + 1 



U ± (i q >)-^ ± (H)\\ 2 



where ip Vk is the to- vector having the coordinates of ver- 
tex Vk in the low-frequency eigenvector space, 



A = 



Y$(i) g ~Vv 1 ® ~$v 3 \ 

I Ip V 2 ® 1p Vl ® 1Pv 3 \ 



(B2) 



Third, since all the m- vectors in Eq. (|B2j) have their ze- 
roth component equal to one, A is proportional to the 
signed area of the triangle having vertices i, vi, and V3 in 
the ip ^-representation. Fourth, this area is proportional 



Here ® denotes the inner product within the (to — 1) 



dimensional ip space and V q is the projection matrix in 



this space that removes the components of [ip — 

ip (*i)] that lie within the subspace containing the 
(q — l)-simplcx. Therefore, d ± (i q+ i) is the distance of 



— »• 



ip {iq+i) from the subspace, and the g-simplex formed 



by adding ip (i q +i) as a vertex is that of maximum hy- 
pervolume containing the previously computed (q — 1)- 
simplex as one of its faces. 
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